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Abstract 

This paper is concerned with a semiparametric partially linear regression model with 
unknown regression coefficients, an unknown nonparametric function for the non-linear 
component, and unobservable Gaussian distributed random errors. We present a wavelet 
thresholding based estimation procedure to estimate the components of the partial linear 
model by establishing a connection between an l\ -penalty based wavelet estimator of the 
nonparametric component and Huber's M-estimation of a standard linear model with outliers. 
Some general results on the large sample properties of the estimates of both the parametric and 
the nonparametric part of the model are established. Simulations and a real example are used 
to illustrate the general results and to compare the proposed methodology with other methods 
available in the recent literature. 

Keywords: Semi-nonparametric models, partly linear models, wavelet thresholding, backfitting, M- 
estimation, penalized least-squares. 

1 Introduction 

Assume that responses y\,...,y n are observed at deterministic equidistant points f, = ^ of an 
univariate variable such as time and for fixed values X„ i = 1, . . . , n, of some p-dimensional 
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explanatory variable and that the relation between the response and predictor values is modeled 
by a Partially Linear Model (PLM): 

y i = Xjp + f(t i ) + u i i = l,...,n, (1) 

where /5 is an unknown p-dimensional real parameter vector and /(•) is an unknown real- valued 
function; the u/s are i.i.d. normal errors with mean and variance a 2 and superscript "T" denotes 
the transpose of a vector or matrix. Given the observed data (j/i,X,) != i / „, the aim is to estimate 
from the data the vector /3 and the function /. 

The interest in partial linear models has grown significantly within the last decade since their 
introduction by Engle et al. (1986) to analyze in a nonlinear fashion the relation between electricity 
usage and average daily temperature. Since then the models have been widely studied in the 
literature. The recent monograph by Hardle et al. (2000) provides an excellent survey on the theory 
and applications of the model in a large variety of fields, such as finance, economics, geology and 
biology, to name only a few. The advantages of such a model is that it allows an adequate and 
more flexible handling of the explanatory variables than in linear models and can be also serve 
as a starting point for dimension reduction by additive modeling. Although there is still lack of 
general theory on testing the goodness-of-fit of a partial linear model, there are some consistent 
specification tests such as, for example, those developed by Chen and Chen (1991). 
Until now, several methods have been proposed to analyse partially linear models. One approach 
to estimation of the nonparametric component in these models is based on smoothing splines 
regression techniques and has been employed in particular by Green and Yandell (1985), Engle et 
al. (1986), Rice (1986), Chen (1987), Chen and Shiau (1991), and Schick (1996) among others. Kernel 
regression (see e.g. Speckman (1988)) and local polynomial fitting techniques (see e.g. Hamilton 
and Truong (1997)) have also been used to study partially linear models. An important assumption 
by all these methods for the unknown nonparametric component f(t) is its high smoothness. But 
in reality, such a strong assumption may not be satisfied. To deal with cases of a less-smooth 
nonparametric component, a wavelet based estimation procedure is developed in this paper, and 
as such it can handle nonparametric estimation for curves lying in Besov spaces instead of the more 
classical Sobolev spaces. 

The estimation method developed in this paper is based on a wavelet expansion of the 
nonparametric part of the model. The use of an appropriate thresholding strategy on the 
coefficients allows us to estimate in an adaptive way the nonparametric part with quasi-minimax 
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asymptotic rates without restrictive assumptions on its regularity. To our knowledge, only few 
developments in the use of nonlinear wavelet methods in the context of PLM models exist in 
the literature. Wavelet based estimators for the nonparametric component of a PLM have been 
investigated by Meyer (2003), Chang and Qu (2004) and by Fadili and Bullmore (2005), more 
recently. Our results will be compared to the later, since the settings adopted in their work are 
relatively similar to ours. 

One novelty of the estimation procedure proposed in this paper is the link between wavelet 
thresholding and classical robust M-estimation schemes in linear models with outliers: using soft 
or hard thresholding or even a SCAD thresholding (see Antoniadis and Fan (2001)) amounts in 
estimating respectively the unknown vector j6 of the linear part in the model by Huber's M- 
estimation or by a truncated mean or by Hampel's estimator. This link allows us to investigate 
the asymptotic minimax properties of the estimators and to derive second-order approximations 
for the bias and variance of the resulting estimators of /5 Q . This is essentially due to the fact that the 
nonparametric part of the model has a sparse wavelet coefficients representation, and the wavelet 
coefficients of a PLM in the wavelet domain appear then as outliers in the linear model composed 
by the linear part. 

Furthermore, the above established link of our method with M-estimation theory offers the 
possibility to use specific M-estimation algorithms for numerically implementing the proposed 
method, instead of using the backfitting technique proposed by Fadili and Bullmore (2005). For our 
numerical implementation, we will adopt a class of half-quadratic optimization algorithms that 
have been developed recently for robust image recognition in the pattern recognition literature (see 
e.g. Charbonnier et al. (1997), Dahyot and Kokaram (2004), Vik (2004) and Nikolova and Ng (2005)). 
The organization of this paper is as follows: Section 2 briefly recalls some relevant facts about the 
wavelet series expansion and the discrete wavelet transform that we need further and presents the 
wavelet decomposition used to model the observed partial linear model. In section 3, we establish 
the connection between wavelet thresholding estimation for the PLM and M-estimation for a linear 
model. Section 4 establishes the main properties of our estimators. In Section 5, we discuss the 
computational algorithms that are used for the numerical implementation of our procedures where 
we also present a small simulation study to illustrate the finite sample properties of our procedures 
and to compare them to the backfitting algorithm proposed by Fadili and Bullmore (2005). Proofs 
of our results are given in Appendix. 
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2 The partly linear model and its wavelet transform 



2.1 The Setup 

Suppose that y ; (i = 1,2, . . . , n) is the i-th response of the regression model at point f , (where t is an 
index such as time or distance) and can be modelled as 

y i = XifS Q +f(t i )+u i , (2) 

where Xj are given p x 1 vectors of covariate values, £, = ^ and ]S and / are respectively the 
parametric and nonparametric components of the partial linear model. We will assume hereafter 
that the noise variables «, are i.i.d. Gaussian Af(0,a 2 ) and that the sample size n = 7) for some 
positive integer /. 

In the nonparametric analysis, the nonparametric part / is modeled as a function lying in an 
infinite dimensional space. The underlying notion behind wavelet methods is that the unknown 
function has an economical wavelet expression, i.e. / is, or is well approximated by, a function 
with a relatively small proportion of nonzero wavelet coefficients. An approach to modelling 
the nonparametric component of the PLM model, that allows a wide range of irregular effects, 
is through the sequence space representation of Besov spaces. The (inhomogeneous) Besov spaces 
on the unit interval, B„ r ([0, 1]), consist of functions that have a specific degree of smoothness in 
their derivatives. The parameter n can be viewed as a degree of function's inhomogeneity while 
s is a measure of its smoothness. Roughly speaking, the (not necessarily integer) parameter s 
indicates the number of function's (fractional) derivatives, where their existence is required in an 
L 71 -sense; the additional parameter r is secondary in its role, allowing for additional fine tuning 
of the definition of the space. For a detailed study on (inhomogeneous) Besov spaces we refer to, 
e.g., D. L. Donoho and Johnstone (1998). To capture key characteristics of variations in / and to 
exploit its sparse wavelet coefficients representation, we will assume that / belongs to B s nr ([0, 1]) 
with S + 1/7T — 1/2 > 0. The last condition ensures in particular that evaluation of / at a given 
point makes sense. 

We now briefly recall first some relevant facts about the wavelet series expansion and the discrete 
wavelet transform that we need further. 



4 



2.2 The wavelet series expansion and the discrete wavelet transform 

Throughout the paper we assume that we are working within an orthonormal basis generated 
by dilatations and translations of a compactly supported scaling function, f(t), and a compactly 
supported mother wavelet, ip(t), associated with an r-regular (r > 0) multiresolution analysis of 
(L 2 [0, 1], (•, •)), the space of squared-integrable functions on [0, 1] endowed with the inner product 
(/, g) = Jj x j f(t)g(t) dt. For simplicity in exposition, we work with periodic wavelet bases on [0, 1] 
(see, e.g., Mallat (1999), Section 7.5.1), letting 

<iW = E#-') and = Ety*( f -0, for ^[o,i], 

where fj k (t) = 2! /2 cp(2H — k) and tfyt(t) = 2> /2 ip(2H — k). For any given primary resolution level 
jo > 0, the collection 

{q>l k , = 0,1, . . . > - 1; tf k , j > jo; k = 0,1, . . . ,2> - 1} 

is then an orthonormal basis of L 2 [0, 1]. The superscript "p" will be suppressed from the notation 
for convenience. Despite the poor behavior of periodic wavelets near the boundaries, where 
they create high amplitude wavelet coefficients, they are commonly used because the numerical 
implementation is particular simple. Therefore, for any / G L 2 [0, 1], we denote by Cj ok = (/, fj ok ) 
(k = 0,1, . . . ,2° - 1) the scaling coefficients and by d jk = (/, ip jk ) (j > ; ; k = 0, 1, . . . ,2> - 1) the 
wavelet coefficients of / for the orthonormal periodic wavelet basis defined above; the function / 
is then expressed in the form 

2»-l co 2>-l 

/(0 = E c iok<Pjok(t) + E E W;Tt(0, f e [0,1]. 

lc=0 ;=; fc=0 

The approximation space spanned by the scaling functions {<p /0 )c, k = 0, 1, ... ,2'° — 1} is usually 
denoted by Vy while the details space at scale ;', spanned by {ipp k = 0,1, ... ,1) ' — 1} is usually 
denote by Wy. 

In statistical settings, we are more usually concerned with discretely sampled, rather than 
continuous, functions. It is then the wavelet analogy to the discrete Fourier transform which is of 
primary interest and this is referred to as the discrete wavelet transform (DWT). Given a vector of 
real values e = (e\, . . . , e n ) T , the discrete wavelet transform of e is given by d = W nxn e, where d is 
an n x 1 vector comprising both discrete scaling coefficients, Sj ok , and discrete wavelet coefficients, 
Wj k , and W„ x « is an orthogonal n x n matrix associated with the orthonormal periodic wavelet basis 
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chosen. In the following we will distinguish the blocs of W nxn spanned respectively by the scaling 
functions and the wavelets. The empirical coefficients Sj^ and of e are given by 

1 " 

Sj ,k ~ —j= E e i<Pk,k^i) for k = 0/ • • • / 210 ~ 1 
V n ,-=i 

f 

] = ;o/. --//-I/ 



v" i=l 



fc = 0,...,2>-l. 



When e is a vector of function values F = (f{h), ...,f(t n )) T at equally spaced points f,-, the 
corresponding empirical coefficients sy fc and h;^ are related to their continuous counterparts 
and djk (with an approximation error of order n _1 ) via the relationships S/ fc ~ \fnc^ and 
w /jfc ~ V*idjk. Note that, because of orthogonality of W nx «/ the inverse DWT (IDWT) is simply 
given by F = W T xn d, where W^ Xf! denotes the transpose of W„ x „. If n = l) for some positive 
integer /, the DWT and IDWT may be performed through a computationally fast algorithm (see, 
e.g., Mallat (1999), Section 7.3.1) that requires only order n operations. 

We will further use the following notation. For a n-dimensional vector e, its Euclidian (or h) norm 

(Ef=i e j) 1/Z wm be denoted by ||e|| and the Frobenius norm of a matrix B with general entries bu 

,, ,, / ?\ 1/2 
will be denoted by 1 1 B \ \ = ( ; - bfj j 

2.3 A wavelet-based model specification of the PLM model 

In matrix notation, the PLM model specified by (2) can be written as 

Y = X)8 + F + U, (3) 

where Y = (yi,...,i/„) > xT = (Xi,...,X„) is the p x n design matrix, and F = 

{j{t\) r . . ./(in)) • The noise vector U = (u\,. . .,u n ^ is a Gaussian vector with mean and 
variance matrix <7 2 I n . 

For the model to be asymptotically identifiable, we will assume: 
(Al) The vector ^X T F tends to as n goes to infinity. 

(A2) The matrix X is full rank, i.e. ±X T X converges towards an invertible matrix. 
Expressing the vector of coefficients of the linear part as 

& = G xTx ) xT ( Y - p - u )' 



clearly shows that conditions (Al) and (A2) are sufficient to asymptotically ensure the identifiability 
of the PLM model. As it will be seen in the Appendix, none of these assumptions is restrictive. 
Let now Z = W nxn Y, A = W nxn X, O = W nxn F and £ = W nxn U. Then premultiplying (1) by W, 
we obtain the transformed model 

Z = Ap + O + £■ (4) 

The orthogonality of the DWT matrix W nxn ensures that the transformed noise vector £ is still 
distributed as a Gaussian white noise with variance a 2 l n . Hence, the representation of the model 
in the wavelet domain not only allows to retain the partly linear structure of the model but also 
to exploit in an efficient way the sparsity of the wavelet coefficients in the representation of the 
nonparametric component. 



3 Soft Thresholding and Huber's M-estimation 

The wavelet shrinkage estimators that are classically obtained by hard or soft thresholding can be 
regarded as an extension of the penalized least squares (PLS) estimator (see Antoniadis and Fan 
(2001)). We therefore propose estimating the parameters /5 and 0o in model (4) by penalized least 
squares. To be specific, our wavelet based estimators will be defined as follows: 

(P„,e n )=argmin\ /„(/5,0) = £ \{z { - Af/5 - 0,) 2 + A £ |0,-| 1, (5) 
(£,0) I *=l Z i=k J 

for a given penalty parameter A, where z'o = 2-"> + 1. The penalty term in the above expression 
penalizes only the empirical wavelet coefficients of the nonparametric part of the model and not its 
scaling coefficients. The choice Z 1 of the penalty function produces the soft thresholding rule. 
The regularization method proposed above is closely related to the method proposed recently 
by Chang and Qu (2004), but these authors essentially concentrate on the backfitting algorithms 
involved in the optimization, without any theoretical study of the resulting estimates. The method 
also relates to the recent one developed by Fadili and Bullmore (2005) where a variety of penalties 
is discussed. Note, however, that their study is limited to quadratic penalties which amounts 
essentially in assuming that the underlying function / belongs to some Sobolev space and does 
not exploit the sparse representation of /. 

In order to establish the link with Huber's estimation we will have a closer look at the minimization 
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W) = 



(6) 



Pa(") 



of the criterion /„ stated in (5). For a fixed value of ft, the criterion J n (ft, •) is minimum at 

z; — Aj ft if i < z'o, 

sign( Z/ - Af 0) ( | Zi - Af j8| - A) + if i > i . 

Therefore, finding ft n , a solution to problem (5), amounts in finding ft n minimizing the criterion 
J n (d(P), ft). However, note that 

Jn(M),fi) = t,Pl(Zi-Ajfi) (7) 

«=«o 

where p\ is Huber's cost functional with threshold A, defined by: 

u 2 /2 if lul < A, 

(8) 

A|m| - A 2 /2 if \u\ > A. 

The above facts can be derived as follows. Let i > z'o. Minimizing expression (5) with respect to 0, 
is equivalent in minimizing j(8j) := \{zi — Aj ft — 6j) 2 + A|0,|. The first order condition for this is: 
/ (6i) = 9i — (z[ — A] ft) + sign(0j) A = where / denotes the derivative of ;'. Now, 

• if 9i > 0, then = if and only if 0, = z, - Aj ft - A. Hence, if z, - AfjS < A, f = and 
otherwise 0, = z,- — AfjS — A. 

• if 0, < 0, ;'(#;) is zero if and only if Q[ = z; — A ; T j8 + A; therefore, if z, — Aj ft > —A, 0; = and 
otherwise 0; = z,- — A, T )8 + A. 

This proves that for a fixed value of )8, the criterion (5) is minimal for 9{ft) given by 
expression (6). If we now replace 6 in the objective function } n we obtain }„(ft,9(ft)) = 

n 

\Ya (( z ! ~~ A lP ~ ^') 2 + since §i = z i - A lP for * < *o- Now denoting by I the set 

i=«'o 

I := {; = i ,...,n, \z } - Ajft\ < A}, we find that J n (ft,e(ft)) = £O z i ~ A J ^ + iL^ 2 + 



A^ (|z,- — Aj ft\ — A) by replacing 0, with (6), which is exactly Huber's functional. 



The mathematical equivalence of the solution of the two classes of estimation can be stated in the 
following proposition. 

Proposition 1. lfft n and 9„ are solutions of the optimization problem (5), then they satisfy 

(9) 



ft n = argmin £ px{z t - Ajft), 



t=t 



9i, n = { 



z i ~ A Jh tfi < l 'o 

k 7so ft,A {zi - Ajft n ) ifi> io, 



i = l,...,n, 



(10) 
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with px being Ruber's cost functional defined in (8) and y so ft,\ the soft-thresholding function with threshold 
A defined by j S oft,A( u ) = sign(u) (\u\— A) + . 

This result allows the computation of the estimators $ n et n in a non-iterative fashion. We can 
estimate the parameter )S directly from the observed data without caring about the nonparametric 
part of the model by means of eq.(9), and then determine 6 n , thence F n using eq.(10). 
The resulting form of the estimators allows us to study their asymptotic properties. Moreover, as 
we shall see in the simulation section of this paper, another benefit is that we can design estimation 
algorithms that are much faster than those based on backfitting. Lastly, Propostion 1 leads to a nice 
interpretation of the estimators. 

We may summarize the estimation procedure as follows. Using the observed data (Y, X) : 

1. Apply the DWT of order / = log 2 (n) on X and Y to get their corresponding representation A 
and Z in the wavelet domain. 

2. The parameter j6 is then Huber 's robust estimator which is obtained without taking care of 
the nonparametric component in the PLM model, given by the optimization problem (9). 
In other words this amounts in considering the linear model z,- = Aj f$ + e, with noise 

e t = 6 0i + 

3. The vector 9 of wavelet coefficients of the function / is estimated by soft thresholding of 
Z — Afi n , i.e. by equation (10). The estimation of / is then obtained by applying the 
inverse discrete wavelet transform. Note that this last step corresponds to a standard soft- 
thresholding nonparametric estimation of / in the model: 

\ji - Xjfi n = f(ti) + v u i = l,...,n, 

where v { = Xf (j6 - /3J + u t . 

Remark 1. The above estimation procedure is in phase with the one advocated by Speckman (1988) who 
suggests that it is usually preferable to estimate first the linear component in a PLM and to then proceed to 
the estimation of the nonparametric one. Indeed, we propose to estimate {S and F by: [S = (X T S T SX)~ 1 SY 
and F = ( I — S) (Y — X/$), with S = (I — T)W,T being the threshold operator. We recognize the exact same 
form as those of Speckman (1988), differing only on the fact that the smoothing operator S is not anymore 
linear. 
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The wavelet soft-thresholding procedure proposed in this section was derived by establishing 
the connection between an l\ based penalization of the wavelet coefficients of / and Huber's M- 
estimators in a linear model. Other penalties, leading to different thresholding procedures can 
also be seeing as M-estimation procedures. For example, if 7 a denotes the resulting thresholding 
function, we can show in a similar way that the estimators verify 



'i,n 



u 



argmin ^ p x {z t - AfjS), 



! = ! 



z { - A/ j6 if i < z , 
7a(z;-A[/3) if2>z , 



1, . . . , n, 



with Pa being the primitive of u i— > u — 7 A («). From what precedes, one sees that hard thresholding 
corresponds to mean truncation, while SCAD thresholding is associated to Hampel's M-estimation. 
The above thresholding procedures and the corresponding criteria are illustrated in Figure 1. 



Thresholding 
functions 
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cost functions 

-0.5 
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Huber estimation 



Hampel estimation 



Truncated mean 



Figure 1: Link between different thresholdings and M-estimation. The dashed line displays the 
least squares criterion. 

However, in this paper, we only concentrate on the properties of estimators obtained by soft 
thresholding, those corresponding to other rules presenting avenues for further research that hope 
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will be addressed in the future. 

4 Asymptotic properties 

Huber's M-estimation was introduced as an alternative to least squares in order to limit the 
sensitivity of the least-squares estimates to each individual observation. While Huber's M- 
estimators do not have finite breakdown points, one can show they are quite robust to outliers 
(see e.g. Hampel et al. (1986)). Huber's M-estimation appears therefore a natural approach for 
robustly fitting the linear part of a PLM, interpreting the wavelet coefficients of the nonparametric 
part as outliers. In what follows, relying upon this analogy, we study the asymptotic properties of 
our estimator. To establish our asymptotic results we will require several assumptions to hold. 

First a condition which ensures the unicity of fi n defined in (9): 

(A3) The series (K n ) defined by K n := \ T!i=i ^i^Jp'K^Oi + converges in the L 2 -norm towards 
a non-singular matrix Kq. 

The next assumption deals with the structure of the regression design matrix. Since the dicrete 
wavelet transform W is orthogonal it follows that A T A = X T X and, therefore when (A2) holds 
the matrix A 1 A is non-singular for n sufficiently large. Consequently, the projection matrix on 
the space spanned by the columns of A, say H = A(A T A)~ 1 A T , has a rank p. In such a case, if 
(h\, . . . ,h n ) denotes the diagonal of H, the equality J^h, ■ = p holds. With regards to the design, we 
will also use the assumption: 

(A4) The quantity h := max Af (A t A) _1 A, tends to when n goes to infinity. 

i=l,...,n 

Assumption (A4) is common in a robust regression framework, validating among other things the 
use of the Lindeberg-Feller criterion. The only difference in our case is that the regression matrix 
that we consider is the wavelet transformed A rather than X, but the relevant discussion in the 
Appendix shows that such an assumption is reasonable. 

Existing results for semi-parametric partial linear models establish parametric rates of convergence 
for the linear part and minimax rates for the nonparametric part, showing in particular that the 
existence of a linear component does not changes the rates of convergence of the nonparametric 
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component. Within the framework adopted in this paper, the rates of convergence are similar, but 
an extra logarithmic term will appear in the rates of the parametric part, mainly due to the fact that 
our smoothness assumptions on the nonparametrric part are weaker. We are now in position to 
give our asymptotic results. 

Theorem 1. Let fi n and 6 n be the estimators defined by (9,10) in the model (1). Consider that the penalty 
parameter A is the universal threshold: A = a^2log(n). Under assumptions (Al)- (A4), we have 



If in addition we assume that the scaling function <p and the mother wavelet ip belong to C R and that xp 
has N vanishing moments, then, for f belonging to the Besov space B s n r with < s — 1/2 + 1/n and 
1/n < s < min(_R, N), we have 



estimation of the vector of parameters /5. The presence of a logarithmic loss lies on the choice of 
the threshold A: taking A which tends to 0, as suggested by Fadili and Bullmore (2005), would lead 
to a minimax rate in the estimation of ft. The drawback is that the quality of the estimation for the 
nonparametric part of the PLM would not be anymore quasi-minimax. This phenomenon was put 
in evidence by Rice (1986): a compromise must be done between the optimality of the linear part 
estimation with an oversmoothing of the functional estimation and a loss in the linear regression 
parameter convergence rate but a correct smoothing of the functional part. 

The method of estimation that we propose leads to quasi-minimax convergence rates and is 
applicable for a large class of functions /. An important remark is also that our procedure is 
adaptative relatively to the regularity of /, thanks to the use of threshold techniques in the wavelet 
decomposition. Note also that Theorem 1 give a Bahadur's representation of fi n , allowing to 
elaborate appropriate testing procedures; such inferential problems are out of the scope of the 
present paper, but interesting for future work. 





where\\f n -f\\l = J \f n -f) 2 . 



The Theorem is proved in the Appendix. As noted previously, we lose a factor yTog(n) in the 



12 



4.1 Estimation of the variance 



Our estimation procedure relies upon knowledge of the variance o" 



2 of the noise, appearing in 



the expression of the threshold A (recall that we have adopted the universal threshold: A = 
cry/2 log(n)). In practice, this variance is unknown and needs to be estimated. One could estimate 
a 1 in an iterative way, i.e. with a backfitting algorithm. We propose instead a direct method of 
estimation based on a QR decomposition of the linear part. 

In wavelet approaches for standard nonparametric regression, a popular and well behaved 
estimator for the unknown standard deviation of the noise is the median absolute deviation (MAD) 
of the finest detail coefficients of the response divided by 0.6745 (see D. Donoho et al. (1995)). The 
use of the MAD makes sense provided that the wavelet representation of the signal to be denoised 
is sparse. However, such an estimation procedure cannot be applied without some pretreatment of 
the data in a partially linear model because the wavelet representation of the linear part of a PLM 
may be not sparse. Indeed, in practice we have observed that for many partly linear models such a 
procedure leads to biased estimations. 

A QR decomposition on the regression matrix of the PLM allows to eliminate this bias. Since often 
the function wavelet coefficients at weak resolutions are not sparse, we only consider the wavelet 
representation at level / = log 2 (n). Let Aj be the wavelet representation of the design matrix X 
at level /. The QR decomposition ensures that there exist an orthogonal matrix Q and an upper 
triangular matrix R such that 



If Zj, 0o,j and Ej denote respectively the vector of the wavelets coefficients at resolution / of Y, F 
and U, model (4) gives 



It is easy to see that applying the MAD estimation on the last components of Q T zj rather than on 
zj will lead to a satisfactory estimation of cr. Indeed thanks to the QR decomposition the linear part 
does not appear anymore in the estimation and thus the framework is similar to the one used in 
nonparametric regression. Following D. Donoho et at (1995), the sparsity of the functional part 
representation ensures good properties of the resulting estimator. 
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5 Simulation study 



The purpose of this section is to study through simulations several algorithms for estimating the 
linear part of a PLM model but also to evaluate the performance of the proposed estimators. Our 
wavelet estimation method for PLM will be also compared with a wavelet backfitting algorithm 
proposed by Fadili and Bullmore (2005). As already noted, our estimation method allows us to first 
estimate the linear regression parameter vector j8 independently of the nonparametric part, and 
to then proceed to the estimation of the functional part of the PLM model. The M-estimation fi n 
of j$ is obtained by means of iterative optimization procedures that are more or less efficient, but 
usually much faster than backfitting procedures, as we shall see. Before proceeding to the analysis 
of our simulation results, we briefly recall two particular optimization algorithms that may be used 
for estimating the linear part. 

5.1 Half-quadratic algorithms 

The minimization problem we have to solve is of the form: 

j8 n = argmin JQ8) with /QS) = £> A ( Zf - - Af/5). (11) 

Minimizers of J(/J) can be obtained using standard optimization tools such as relaxation, gradient, 
conjugated gradient and so on, but even if the loss function is convex, its second derivative 
is large near to zero, so the optimization may be slow. For this reason, specialized optimization 
schemes have been conceived. A very successful approach is half-quadratic optimization, proposed 
in Geman and Reynolds (1992) and Geman and Yang (1995) for cost functions of the above form. 
The idea is to associate with every )8 in (11) an auxiliary variable c and to construct an augmented 
criterion K, such that for every c fixed, the function )8 — » K(j6, c) is quadratic (hence quadratic 
programming can be used) whereas for every )S fixed, each c can be computed independently 
using an explicit formula. The augmented criterion K is chosen to have the same minimum as 
/, attained for the same value of /5. The optimization problem of the augmented energy can be 
solved iteratively. At each iteration one realizes an optimization with respect to (1 for c fixed and 
a second with respect to c for )8 fixed. More precisely, if and c( m ) are the values given after m 
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iterations, the (m + l) th step of the algorithm actualizes these values through: 

p^+V = argminK(p,cW) 

P (12) 
c (m+i) = argminK(^ m+1 \c) 

c 

This procedure leads to two algorithms, namely ARTUR and LEGEND, that are also referenced in 
the literature as IRLS and IMR. We refer to Nikolova and Ng (2005) for some theory on the their 
use with Huber M-estimation. These algorithms are used for example in robust recognition (see 
e.g. Charbonnier et at (1997), Dahyot et al. (2004) or Vik (2004)). Vik (2004) in particular stresses 
the link between ARTUR and LEGEND and Huber 's approach. 

ARTUR 

The algorithm described hereafter is referenced as the ARTUR algorithm in the optimization 
literature or as Iterative Reweighted Least Squares (IRLS) in the robustness literature. Geman and 
Reynolds's theorem leads to an augmented criterion of the form 

X( j S,c) = £c ! (zi-A[ i S) 2 + Y(c). 

z=l 

The auxiliary variable c corresponds to a weight on the residuals of the least squares fit, thus 
explaining the IRLS terminology. Intuitively, weights on large residuals have a tendency to 
eliminate the corresponding responses from the fit. For fixed, the minimum is reached for 
c i = where r, is the z'th residual r, = z; — Aj j6. At this point the value of Y is p\ (r,-) — p' A (r,- ) r, / 2. 
The m + 1 step of the ARTUR algorithm can therefore be described as follows: 



= Z; 



^ = ^ V ie{ l n } 



2r 



0(m+i) = (A T c( m+1 )A)- 1 A T c( m+1 )z 



LEGEND 



LEGEND, or Iterative Modified Residuals (IMR), is a slightly different algorithm. The auxiliary 
variable doesn't weight the residuals anymore but subtracts the larger values of the residuals 
instead. The existence of the corresponding augmented energy functional follows from the second 
theorem of Geman and Reynolds (1992). The criterion to be minimized can be written as 

K(p / c)=f^(z i -Ajp-c i ) 2 + ac). 

i=i 
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For /S fixed, the minimum is reached for c, = ri (l — ^7-^) where r, ith residual r, = z,- — AfjS and 
at this point the function £ takes the value p\(ri) — p' A (ri) 2 /4. 

With similar notation as for the ARTUR algorithm, the m + 1 step of the LEGEND algorithm can be 
described as follows: 

r(«) = Z-A)6 (m) 

< cj m+1 ) = r ^(l-^) Vie{l n} 

0(m+i) = (A T A)- 1 A T (Z-c( m + 1 )) 

Both ARTUR and LEGEND are very easy to program. Nikolova and Ng (2005) show that the 
risk obtained via the multiplicative form ARTUR is always smaller than the one obtained via the 
additive form, but the later one is numerically faster. The main reason for this is that under the 
multiplicative form a matrix inversion is performed within each iteration. 

5.2 Numerical simulations 

In this subsection, we give some simulation results. All the calculations were carried out in 
MATLAB 7.0 on a unix environment. For the DWT, we used the WaveLab toolbox developed by 
Donoho and his collaborators at the Statistics Department of Stanford University (http:/ /www- 
stat.stanford.eduAwavelab). For each of the simulated examples in the sequel, we may summarize 
the various ingredients of our fitting procedure as follows: 

1. Application on the observed data of the discrete wavelet transform (DWT) using the 
pyramidal algorithm of Mallat (1989); 

2. Estimation of the variance a 2 by means of a QR decomposition on the matrix of wavelet 
coefficients at maximal resolution followed by a MAD estimation; 

3. Estimation of j8 with ARTUR or LEGEND, solving (9); 

4. Estimation of Qq by soft thresholding of Z — A/J n , given by (10); 

5. Finally, estimation of /„ by applying the inverse DWT on 9 n . 

We will compare with Fadili and Bullmore's procedure that estimates conjointly j8 and 9q using a 
backfitting algorithm. 

In order to reduce the number of iterations, we have used a stopping criterion in both ARTUR 
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et LEGEND: while fixing a larger upper bound for the total number of iterations allowed, we 
also consider that the algorithm has converged as soon as the difference between two successive 
iterations is smaller than some given threshold 5. More precisely, the iterations are stopped as soon 

as — 7-f- — — < 5 or whenever we attain their upper limit. 

||jS (m) || 2 vv 

For illustration, we generated three test problems as follows. The nonparametric component /o 
was selected among two different functions, one sinusoidal function and one piecewise constant 
function. The covariate is chosen as X,- = g{i/n) + tjj with polynomial functions g and with the 
(Vi)i=i,...,n generated independently from a centered distribution with finite variance, as explained 
in Section 6. For DWT, the filter we used is the Daubechies Symmlet filter with 8 vanishing 
moments. The sample size we took was n = 2 8 . For each setting, 500 replicates of data with 
different X and u were generated. The variance of the noise was chosen such as the signal-to-noise 
ratios of the nonparametric and parametric component respectively were equal to 2.2 and 4.38. 
Such choices seem reasonable. With the simulated data, we then used the proposed algorithms 
to estimate the unknown parameters. For wavelet thresholding the universal threshold was used, 
while the termination tolerance 5 was set to 10~ 5 for ARTUR and 10~ 10 for LEGEND. For Backfitting, 
we have used the algorithm of Fadili and Bullmore (2005) with a tolerance level 5 equal to 10~ 20 . To 
save computational time we have also specified an upper limit of 2000 for the maximum number 
of iterations allowed. 

Example 1: Sinusoidal test function 

In examples 1 and 2, the covariate was generated using the polynomial function g(t) = t 5 + 2t 
and with the (j/i)i=i,...,n generated independently from N(0, 1). We have also run some numerical 
simulations with different design functions g such as g(t) = 2 f , g(t) = e~ f2 or g(t) = cos(t) with 
similar results, not reported here by the lack of space. It seems that assumption (A4) is not really 
necessary for asymptotic consistency. 

We first consider the case of a sinusoidal function for the nonparametric part. In such a case one 
could obviously use smoothing splines based semiparametric estimation but it is interesting to see 
how our wavelet based procedure behaves. Figure 2 displays the wavelet transform of the data 
and of the design matrix. Note that the sparse representation of the nonparametric part allows an 
efficient reduction of the bias between the observations and a linear model. The dashed lines in the 
plot displayed in Figure 2, represent the lines X,/? ± A. Observations lying far out from these lines 
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Figure 2: Wavelet transform of the data. Figure (a) represents the scatter plot of the observations 
versus the values of the covariates Xj. The line is the linear part of the model, of equation y, = X,j8 . 
Figure (b) is the scatter plot in (a) after the Discrete Wavelet Transform: it represents the coefficients 
Z\ versus A;. The solid line is the linear part of the model (equation z, = A)8 ) and the dashed lines 
are the lines of equations z, = A,/S ± A. 

We now evaluate the effect of the QR decomposition on the estimation of the noise, and we 
compare the computational time required by each of the algorithms, namely ARTUR, LEGEND 
and Backfitting over the 500 replications of the experiment. 



Estimation of a by MAD 
True value without QR with QR 

0.5 1.2222(0.0955) 0.5023(0.0511) 

Table 1 : The mean values of the estimates and their standard deviation over the 500 simulations in Example 
1 with n = 2 8 (the standard deviation appears in brackets). 



From Table 1, we get a fairly good impression on the effect of the QR decomposition on the 
estimation of the noise variance: the presence of the linear part introduces a strong bias in the 
MAD estimator, bias which is strongly diminished when using the QR decomposition. This also 
explains why in the comparison of their various thresholded estimators, Fadili and Bullmore (2005) 
often obtain estimators that are over-smoothed, since the variance that is used in their thresholds 
is over estimated. To be fair, we therefore have adopted for all methods the universal threshold 
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A = u^jT. log(n) with a estimated by MAD after a QR decomposition. 



Estimation of /S 


True value 


Backfitting ARTUR 


LEGEND 


1 


0.9000(0.0273) 0.9417(0.0327) 


0.9417(0.0327) 


Average computing time 


0.0936 0.0232 


0.0151 



Table 2: The mean values of the estimates and their standard deviation over the 500 simulations in Example 
1 (standard deviation appears in brackets) with n = 2 8 . The average MISE for the nonparametric part for 
these simulations is 0.1029 for ARTUR and LEGEND and 0.1098 for Backfitting. 

From the last row of Table 2 one can see that both half-quadratic procedures (ARTUR and LEGEND) 
are faster than Backfitting and the quality of estimation of both the parametric and nonparametric 
parts in terms on mean squared error is also better. The differences observed in estimating /3 
between the various procedures is mainly due to the different tolerance levels 5 used by each. 
Note also that Backfitting always stops because the maximal number of iterations is reached. The 
estimation given by Backfitting could be improved but at the cost of a much larger computational 
time. 

Recall that for both half-quadratic based algorithms, once the unknown parameter fi is estimated, 
a nonparametric wavelet based estimation procedure is applied to the resulting residuals y,- — X,-j8 
for estimation of the nonparametric part. Figure 3 displays a typical example of these residuals and 
of the corresponding nonparametric estimation using ARTUR on one replication. 
For the value of the signal-to-noise ratio (SNRf = 2.2) adopted in our simulations for the 
nonparametric part, the estimator does not detect the discontinuity. However it produces results 
very similar to those by standard wavelet denoising of an identical nonparametric signal (without 
a linear part) with the same SNR, supporting our claim that the presence of the linear part in a PLM 
doesn't affect the estimation of the nonparametric part. 

In their numerical implementation of ARTUR et LEGEND, both Vik (2004) and Dahyot and 
Kokaram (2004) conclude that LEGEND converges faster, supporting the theoretical results of 
Nikolova and Ng (2005). To share some light on this fact we have run some simulations with a 
larger number sample size. With n = 2 10 observations and the same signal-to-noise ratio as before 
one can see a clear difference in computational time among the two algorithm for estimators with 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

(a) (b) 

Figure 3: Estimation of the nonpar ametric part in Example 1. Figure (a) represents the residuals obtained 
after estimation of the linear part of the models, meaning z, — AjjfL, and the true functionnal part (dash), in 
Figure (b) we have the resulting estimation of the function (solid) and the true function (dash). 



Estimation 




0.1 02 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Figure 4: A typical partial linear fit from Example 1. The figure represents the scatter plot of the 
observations, the estimated functionnal part (dash) and the parial linear fit (solid) for one of the simulation. 

equivalent qualities, as reported in Table 3. 



Example 2: piecewise linear function 

We would like now to illustrate our estimation procedure when the nonparametric part is highly 
non regular. We thus consider a function fo which is piecewise constant. It is obvious that for such 
a function, our wavelet based procedure is better suited than a spline based procedure. All other 
setting adopted for these simulations are the same as those for example 1. 
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Estimation of j6 


True value ARTUR 


LEGEND 


1 0.9762(0.0127) 


0.9762(0.0127) 


Average computing time 0.2331 


0.0166 


Average number of iterations 7 


59 



Table 3: The mean values of the estimates and their standard deviation over the 500 simulations in Example 

1 (the standard deviation appears in brackets) with n = 2 10 . LEGEND is much faster than ARTUR. 

Estimation of a by MAD 
True value with QR 

0.5 0.49961(0.052741) 

Table 4: The mean values of the estimates and their standard deviation over the 500 simulations in Example 

2 (the standard deviation appears in brackets) for n = 2 8 . 

The results given in Table 5 reinforce our claim from example 1 that half-quadratic algorithms are 
more efficient than Backfitting. Note moreover that the non regularity of the nonparametric part 
does not seem to affect the quality of the estimation of the vector of regression parameters. 



Estimation of /3 for n = 2 8 


True value Backfitting ARTUR 


LEGEND 


1 0.8999(0.0273) 0.9548(0.0309) 


0.9548(0.0309) 


Average computing time 0.0744 0.0209 


0.0139 



Table 5: The mean values of the estimates and their standard deviation over the 500 simulations in Example 2 
(standard deviation appears in brackets). The average MISE for the nonparametric part for these simulations 
is 0.1012 for ARTUR and LEGEND and 0.1078 for Backfitting. 

As in example 1, one can see from Table 5 and Table 6 that LEGEND outperforms ARTUR, and that 

the difference of computing time increases with the number of observations n. 

The estimation of the nonparametric part does not detect the discontinuities of the function. Yet 
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Estimation of /5 for n = 2 10 



True value 



ARTUR 



LEGEND 



0.9554(0.0149) 0.9554(0.0149) 



Average computing time 0.3036 



0.0209 



Table 6: The mean values of the estimates and their standard deviation over the 500 simulations in Example 2 
(standard deviation appears in brackets). The average MISE for the nonparametric part for these simulations 
is 0.0584 for ARTUR and LEGEND. 



iduals after a first step 




Estimation of the nonparamteric part 



■ True function 

_ Estimation within a PLM mode 1 

■ Nonparametric Estimation 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



(a) 



(b) 



Figure 5: Estimation of the nonparametric part in Example 2. Figure (a) represents the residuals obtained 
after estimation of the linear part of the models, meaning z, — A,/5 n , and the true functionnal part (dash). In 
Figure (b) we have the resulting estimation of the function (solid) and the true function (dash). 

compared to standard wavelet denoising in a nonparametric regression model with the same SNR, 
the estimation obtained in the PLM is very similar. The bad visual quality of the estimation results 
from the choice of the signal-to-noise ratio (SNRf = 2.2) adopted in our simulations rather than 
the presence of the linear part. 



Example 3: dimension 4 

We now consider a case where the vector of parameter /5 belongs to M 4 (the dimension of the design 
regression matrix X is then n x 4). The nonparametric part is the same as in example 2, meaning 
that the function is highly irregular. The SNR for the global model was chosen equal to 5.99, with 
a SNR equal to 4.38 for the nonlinear part. One may summarize the results for this example in the 
above tables. 
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Estimation of a by MAD with QR 
True value with QR 

0.5 0.52261(0.053808) 

Table 7: The mean values of the estimates and their standard deviation over the 500 simulations in Example 
3 (the standard deviation appears in brackets). 



Estimation of /S, 



True value 


Backfitting 


ARTUR 


LEGEND 


-1 


-1.4969(0.45822) 


-0.7203(0.461) 


-0.7203(0.461) 


3 


2.8563(0.09770) 


2.9168(0.09941) 


2.9168(0.09941) 





-0.1201(0.33685) 


0.0125(0.34415) 


0.0125(0.34415) 


8 


7.5601(0.16772) 


7.7112(0.18525) 


7.7112(0.18525) 


Mean squared error 


0.8434 


0.5438 


0.5438 


Average computing time 


0.1602 


0.0305 


0.0234 



Table 8: The mean values of the estimates and their standard deviation over the 500 simulations in Example 
3 (the standard deviation appears in brackets) for a given value of the true /3 . The average MISE for the 
nonparametric part for these simulations is 0.2140 for ARTUR and LEGEND and 0.2164 for Backfitting. 

As one can see with computational times that are similar for all procedures, both half-quadratic 
algorithms outperform Backfitting in terms of the MSE. 

As for examples 1 and 2, when the sample size increases, among the half-quadratic algorithms the 
LEGEND one is much faster. 

Conclusion 

This paper develops a powerful penalized least squares estimation in partially linear models, based 
on a wavelet expansion of the nonparametric part. Choosing an appropriate penalty on the wavelet 
coefficients of the function, the procedure leads to an estimation of the linear part of partly linear 
models independent from the nonparametric part, while the estimation of the nonparametric part is 
adaptative relatively to the smoothness of the function. Since the functionnal part of the model has 
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Residuals alter a first step 



Estimation ol the nonparametric part 



■ Function 



■ — ■ True function 

Estimation within a PLM model 

— - Nonparametric estimation 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.6 0.9 1 



(a) 



(b) 



Figure 6: Estimation of the nonparametric part in Example 3. Figure (a) represents the residuals obtained 
after estimation of the linear part of the models, meaning Z\ — A,/5 n , and the true functionnal part (dash). In 
Figure (b) we have the resulting estimation of the function (solid) and the true function (dash). 



- Estimation of the linear part 

Estimation of the functional part 

Estimations of X p+f 




O.f 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Figure 7: A typical partial linear fit from Example 3. The figure represents the scatter plot of the 
observations, the estimated functionnal part (dash) and the partial linear fit (solid) for one of the simulation. 



a sparse representation, the estimation of the regression parameters vector is moreover interpreted 
as a common M-estimation. In the particular case of an / 1 -penalty (leading to soft thresholding and 
Huber's estimator) the near-minimaxity of the estimation of both parametric and nonparametric 
parts of a partially linear model is established, and the result is avalaible for a large class of 
functions, including nonsmooth irregular functions. From an implementation point of view, half- 
quadratic algorithms are proposed that appear to give good results on simulation studies. 
Our ongoing research is focusing on exploring the asymptotic properties of the procedure for other 
thresholding schemes and in more general frameworks such as nonequidistant designs for the 
nonparametric part. 
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6 Appendix 

Appendix A. Discussion of the assumptions. 

In this Section, we study wether the assumptions made in Theorem 1 are reasonable in practice. 
Following Rice (1986) or Speckman (1988) we suppose that the design matrix X can be written 
as a sum of a deterministic function and a noise term. The (z,/)-component of X can be written 
as Xij = gi(tj) + £y with functions gi such that J fgi = and where £y denotes a realization of 
a random variable The variables (£i);=i,...,n are supposed to be independent and identically 
distributed, centered and with finite variance, independent from the u,-. With these notation, 
assumptions (Al), (A2) and (A4) become: 

(Al) The norm of ^X T F can be decomposed as follows: 

ii^Foii 2 = eju (* i?=ift(t;)/(t/) + \ ^mf- 

The convergence towards of the first term is ensured by the assumption that J fgi = for 
all i = 1, . . . , p. We can prove that the second term tends to almost surely. 

Remark 2. When we suppose that Vz, / fgi = 0, this impose that either the integral off is equal to 
zero or the vector l„ x i is not in the space spanned by the columns ofX. This is the usual assumption 
for identifiability in PLM (e.g. Chen (1988) or Donald and Newey (1994)). 

(A2) Let V(g) be the matrix with entries J gigj and V denotes the covariance matrix of the variables 
(£i)i=i,...,n- O ne can prove that ^X T X converges almost surely to V(g) + V. It is sufficient to 
assume that the family (g;);=i,...,n is L 2 -orthogonal in order that the matrix V(g) + V is non 
singular. 



(A4) Actually, it is equivalent to prove that isup||A,|| 2 — > to get (A4). For i G {l,...,n} 

r i 2 

given, i||A,|| 2 is equal to n^AfAj = Y% = i \YI]=\ x Piifj)^j,l ■ With the previous notation, 
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x j,l = gi(tj) + <oj,l an d we can establish that n _1 A?A; tends almost surely to £f =1 (/ tyigl) ■ 
This can also be written as n^Aj A, ~ Y4=\{ w \) 2 with (i£7|) i=1/ n wavelets coefficients of the 
functions g/ . 

If, for all Z = 1, . . . , p, gi is a polynomial function whose degree is less than or equal to the 
number of vanishing moments N of the wavelet mother, then this assumption holds. 

Hypothesis (A3) is not detailled here because even if it does not seem very constraining, it is 
difficult to study its feasibility. 

To conclude, when the design X,, i = l,...,n can be written as X, = gi + with g,- orthogonal 
polynomial functions with a degree less than or equal to N, and with centered independent 
random variables with finite variance, whenever J /g, = for all i, assumptions (Al), (A2) and 
(A4) hold. 

Appendix B. Proofs of the main results 
B.l. Preliminary result 
Proposition 2. 

When assumptions (A2) and (A3) hold, 

-4=EPA(0Oi + e f )Af = Op(A) 



This result comes from Bernstein's inequality applied to the random variables = -^^-^ — -, 
i = l,...,n, for any fixed / in {1,. . .,p}. Indeed, these variables are almost surely uniformly 
bounded and Ya=i E[V?y] is bounded, due to the following lemma: 

Lemma 3. If {AT) and (A3) hold, 

(i) n- 1/2 sup i=1 n || A ; - 1| -> 

(ii) n- 1 ^ „||A ; -|| 2 = 0(1) 

This result lies on the observation that ||A,|| 2 = Aj (A T A) 1/2 (A T A)~ 1/2 Ai, and consequently 
||A f || <n 1/2 \\(lA T A) l/2 \\h} /2 . 
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B.2. Variables transform 

Let us recall that we are studying the model 

Zi = Afp + eoi + ei under (A2)-(A4) (13) 

(Assumption (Al) is an identifiability assumption and does not intervene in the proofs). Following 
Huber (1981) or Bai et al. (1992), we build an equivalent model by a change of variables. Let us 
define the following transforms: 

R = A(A T A)- 1/2 , 

di = ^(0/ + e f ). 
The results may be established equivalently for the following model: 

Zi = Rjcco + di under (A2")-(A4"); (14) 

(A2") R T R = l r 

(A3") h = max Rj R; tends to 0. 

i—i ,...,n 

(A4") K% := E"=i RfRfE [p^ (di)] tends to non singular matrix. 
As the Huber cost function has scale transform properties: 

for any v > 0, p x (u) = v 2 p A/v (u/v), (15) 

we then can prove that in the model (14), the estimator k n is solution of the minimization problem 

n 

k n = argmin Y^P^i^i ~ Rjcc). 

DC i= i 

As p' x (u) = \p[(u/A) and p'[(u) = p'[{ulX), we have Kq ~ S _1 -Ko and Proposition 2 becomes in 
(14): 

Epi(df)Rf = Op(l). 

i=io 

In all the proofs, we will consider the model (14) and obtain the consistency results thanks to the 
mentionned transforms. 
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0. 



B.3. Convergence of the criterion 
Proposition 4. 

Let cbea strictly positive constant. Suppose (Al) to (A4) hold. Then, 

sup £ \Lti M% + e,- - Af (j6 - ft)) - p x (9 0i + e,-)) 

{||^-ft||< C AH-l/2} 

+ £ ^(flbi + £; -)Af()8 - ft) - 4(0 - ft) T K (£ - ft) 

*'=*o 

The proof is built on two phases: we first approximate the Huber cost function p with a smoother 
function, keeping a control on the third derivative; secondly, we develop a scheme of proof very 
similar to Bai et al. (1992) in the transformed model (14). The main argument is the convexity of p, 
which allows in particular the use of Rockafellar's theorems. 

B.3.1. Approximation of Huber cost function 

The approximation is built by three successive integrations. Let < 8 < 1. We define r\ on M: 



tj : u 



£(k- (1 -8/2))(u - (1 + 8/2)) til- 6/2 < \u\ < 1 + 8/2 
otherwise 



We introduce next, primitive of equal to zero at 1 + 8/2, r\ primitive of r\ equal to zero at 
and r$, primitive of rj equal to zero at 0. 

The function series p\ = r\ /n2 is a series of convex functions C 3 , which converges uniformly towards 
pi when n goes to infinity. We can furthermore prove that J \pf ^ \ < 12, and that 



"llpi-pilU — > °/ ( 16 ) 

n||pi-pi|U ^ 0, (17) 
IK-PiIU < 1. (18) 



Moreover, p" and only differ from each others on two intervals of length 1/n 2 . 
B.3.2. Preliminary tools 

Proposition 5. Let C be an open compact set ofM m . We consider (f n )neN an & f a family of convex 
functions defined on C and taking their values in a given probability space (Q, P, u). Suppose for all u G C, 
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f„(u) — f(u) converges in probability to 0. Then the convergence in probability o/sup| uGC | f n {u) — f(u) 
towards is acquired. 

Proof. We recall a theorem given in Rockafellar (1970) (Theorem 10.8, page 90): 

Proposition 6. Let C be an open compact set ofM. m . We consider (/ n ) n eN an & f a family of finite convex 
functions defined on C. Suppose the series (f n )neN converges simply to f on C. Then the convergence is 
uniform on C. 

In order to obtain a similar result for the convergence in probability, we may use the following 
characterization of such a convergence: 

Lemma 7. Let (X n )„ be a series of random variables and X a random variable. The series (X„) converges 
in probability towards X if and only if from all subsequence ofX„ we can extract a series which tends almost 
surely to X. 

Consider /„(„) a subsequence of /„. We would like to find j6(n), subsequence of v{n), such that for 
all u G C, /jg( n ) ( M ) — f( u ) 0- The Lemma 7 tells us that for all u G C there exists t] u (n) extraction 
of v(n) such that /« M ( n )( M ) ~~ f( u ) ~^ 0- Let us consider V = {uq, u\, Ui . . ■} dense and countable 
subset of C. Using a diagonal procedure, we can exhibit (]6(n)) such that for all u G T>, we have 
/jg( n )(w) — f( u ) 0. Afterwards, the convergence of /jg(„) — / on C holds by density of V and 
continuity of /jg(„) _ /• Applying Rockafellar's Theorem, we obtain that supf^^ (u) — f{u) tends 
almost surely to 0. 

To conclude, we have proved that from all subsequence sup/ v ( n ) (u) — f(u) of sup/„ (u) — f(u) we 

ueC uec 
could extract a series which converges almost surely to 0. This finishes the proof using Lemma 



7. 



□ 



B.3.3. Convergence criterion 

Let c > 0. We are going to prove that in model (14) we have: 



sup 

ll«ll<4 



0. 



l — lg i=\ 

Note that in the initial model (13), this is equivalent to 

sup £ |EL {px(6oi + e f - Aj(p - J8 )) - p A (0 O! - + e/)) 

{||^-ft||<cA»-i/2} 



+ £ p'A^i + £; -)Af()8 - jS ) - ni(j8 - jS ) r K ()S - ft,) 



(19) 
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We introduce: 

A(«) := f] ((5i(di - Rfa) - ^(rff) + p'^Rju) . 
The cost function p\ is convex. For every i, it gives the upper bound: 

pi(di - RJk) - (5i(rff) +pi(di)Rfa| < \p[(di - Rjct) - p\(di)\\Rj u\ 

This inequality gives a bound of the variance of A (a): 



(20) 



V«r(A(a)) < E E - Rfa) - p[(d t 

1=1 L 



iRfal 2 . 



The function ^ being 1-Lipschitz, 

Vn G N, Vz = l,...,n, Vu <E R+, E + u) - ^(d,)) 2 < « 2 . 

Consequently, 

Var(A(a)) < £ |Rfa| 4 < || a |j 4 £ ||R f || 4 . 

i=\ i=l 

As a is supposed to be bounded and Ya=i |R;| 4 — ^YL^i = hp tends to 0, we obtain that Var(A(oc)) 
tends to 0. Bienayme-Tchebychev inequality ensures then that | A (a) — EA(a) | converges towards 
in probability. 

The termEA(a). 

As the function p is C 3 , the Taylor expansion of degree 2 with a rest of an integral form of p\ on a 
neighborhood of d[ exists. It gives: 

p x (di - Rju) - p x {di) + piK)Rf« " \p"i (di)ct T RiR[u 

= -f$\m-t)\_^ m dt/6. 

Using the bound / pf\t) dt < 12, obtained when constructing p, we obtain: 



E 



£ (pi (rf/ - RJol) - pi(di) + p'^Rj cc - h'Kd^RiRjcc 



<2|kii 3 £||R ! -| 

z=l 



Note that E"=i ||R/|| 3 < h l/1 Zh = h 1/2 p -» 0. Therefore, when ||a < c, 

EA(a) = l a T K> + o(l), with^ = £ R,-RfE [p'lM)] ■ 



1=10 



Actually, K„ converges towards Kq. Let us decompose \\K'^ — Kq\\ in 



\y" y"\\ <r \\y" y"\\ _l \\y" y"\ 
\J^n — K II S ll^-n _ K n II + ll-^n ~~ *0 I 



30 



The convergence to of the second term is ensured by hypothesis (A3"). The first term is: 



The functions p" and p'[ only differ on intervals whose total length is 2/(n 2 ). Consequently 
E(p'{(dj) - p"(di)) < 2/(n 2 )\\p" - p'l\\oo\\M\co where f e denotes the density function of e t . We 
obtain the inequality: \\K% - K%\\ < \h l,1 C, with C a constant. As h tends to under (A4"), we 
deduce that X" converges towards Xq and thus EA(a) = ^o. t KqOc + op(l). 
When ||a|j < c, the convergence in probability of |A(a) — EA(a) | to implies: 



If D and D respectively denote D := Y2=\p\{&i ~ R[«) - Pi(d ; ) and D := LLiPi( d i ~ R f a ) ~ 
p\{di), then |D — D| < n||pi — pi||oo- Using (16), we obtain the almost sure convergence of 
D — D to 0. In the same way if B := E"=iPi(^') R f a and B := L"=l Pi( d i) R I a ' we then have 
|B — B\ < n\\p[ — p[ |joo|ja||/2 1/2 . When |ja|j < c, properties (17) implie that B — B tends almost 
surely to 0. All together, we have: 



are convex and the set {||a|| < c} is convex, compact and independent from n. Proposition 5 
completes the proof. 

B.4. Proof of Theorem 1 
B.4.1. Consistency 

In the model (14), we are willing to prove that &„ = Op(1)- Let c n — » oo. We may prove that 
P (||a„ || > c n ) — » 0. We can deduce from (19) that there exists a series c' n such that c' n — » oo, c' n < c n 
and 



^-x^^R^p'/KO-p'/^)). 





• We may prove now that the convergence is uniform on the set { ||tt || < c}. 



The functions in a: 




sup ^ 

{ 11*11 <c»} «=i 
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It is sufficient then to prove that P ( \\& n || > c' n ) — » 0. 
Suppose || a || = c' n . 
We have 

£ Ufa - Rjcc) - Pl (di)) = -f^p[(di)Rjcc + U T K%u + o P (l). 

!=1 V 7 1=1 Z 

First, 

Ili^H^i^OK) 2 , 

with s(Kq) smallest eigenvalue of Kq. As the matrix Kq is nonsingular, s(Kq ) > 0. Next, Proposition 
2 implies that 

lj£p;(d,)Rf«ii = Op(4)- 

z'=l 

As a consequence, the probability that the quantity 

n y 

J>i(di - R[ u ) - pi (di) = -J^p^Rja + —oJKqOc + o P (l) 
i=i z 

is negative tends to 0. This result is true uniformly for a verifying ||a || = c' n . We obtain: 

P L M Xp 1 (d i -Rjcc)-p 1 (d i )<o]^0. (21) 
\{et, \\*\\=c n } & J 



Let a be such that ||a| > c' n . 

We define t = e]0;l] and a' = fa. With the equality d; - R/V = (1 - t)d t + - Rjcc), 
together with the convexity of p, we have: 

Pl (di - R?V) - pi(rfi) < t - Rfa) - pi(di)) . 

As ||a'|| = c'„, it comes that: 

P ( inf £ pi (rfi - Rjcc) - pi (d f ) < ) 0, (22) 

\{0t, \\0C\\>c'„} i=1 J 

or equivalently: 



inf /„(«)< J„(0) ^0. 
{«■, \\cc\\>c n } J 

The estimator k n has been defined as the argument realizing the minimum of /„, and so, 
P (||ttn|| > c'„) tends towards zero, which achieves the proof. 
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B.4.2. Bahadur's representation 

We want to prove that in model (14), we have 



/ 1 n 



&n = K% J^p'^Ri +op(l). 



Let us first recall this result given in Rockafellar (1970): 

Proposition 8. Let C be an open convex set. Let f n be a family of differentiable convex functions and f be a 
differentiable convex function. If f n converges simply towards f on C, then Vf„ converges simply towards 
Vf on C and the convergence is uniform on every compact set of C. 

Similarly to Proposition 5, this Proposition can be generalized to a convergence in probability (using 
Lemma 7). 

Applying this Proposition to the result (19) gives us that, for all c > 0, 



sup 

Hall <c 



0. 



£U(d,-Rfa)R ; -pi(d ; )R ; ) +K%k ~0. 

z=l V 

We have proved precedently that k n = Op(1)- Then, 

£ U(di - Rj&„)Ri - p'tid^Ri) + K'J& n 
By definition of &„, Ya=\ Pii^i ~ Rj&n)Ri = 0. The convergence of (23) becomes: 

which is the announced result. 



(23) 



,!=1 



B.4.3. Asymptotic behavior of the functionnal part 

The model considered for this part of the proof is the model (13) contrarily to what precedes. 
Parseval equality gives: \\f n — /lb ~ nll^" - ^o||- We decompose this bound into: ^\\6 n — 9o\\ < 

h\e n -e n \\ + l\\e„-e \\ where 



hn = { 



Zi - AfjS if i < z' 

sign(zi - Afp ) (\zi - Af j6 1 - A) + if i > i 



D. Donoho (1992) proved that there exists a constant C such that E±||9 B - O || < C 1+2s • The 

convergence in L 2 implies the convergence in probability. 
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The term \\\6„ - 9„\\ verifies the inequality ±||0„ - 0„\\ < £||-A||||j6 n - j6 || + 2^. Assumptions (A2) 
and (A3) ensure that A \\ = Q E llA^ 2 ) 172 is bounded and that ||)8 n - jS || = Op(^) through 
the first part of the Theorem. Then, ±||0 n - 6 n \\ = Op(£) = OfC^tt^)- 
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